-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tutorial restructure draft #44
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry in advance if the comments are too drastic! Let's discuss if there's a way to address them. I won't block this refactoring for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I loved how everything is well structured and goes step-by-step! Such an improvement (at least for me 😅)
Minor comment, just thinking out loud: for Tut 1, what is the reason we use ElasticSearch (as opposed to InMemory)? Is it to start off with a good default? The "QA System" -> "QA System without ES" rather than "QA System (InMemory)" -> "QA System with ES" has always felt a bit odd to me. 😄 Just a hunch: someone not familiar with ES might find all the ES setup intimidating (albeit it's just running the given code/commands). |
I actually agree with you on this one! I think ES is a complicating factor this early on. And they aren't working at the scale where they will be getting the benefits of an ES server instead of an InMemoryDS. One thing though is that InMemoryDS will require using the TFIDFRetriever instead of the BM25Retriever which is not a recommended default. |
We could also consider adding support form BM25 in InMemory in the near future if there's a quick and easy way to do that. There might be libraries for it, so we wouldn't need to implement BM25 from scratch (for example https://pypi.org/project/rank-bm25/). |
To give a general opinion, I really appreciate this restructuring of the tutorials: it seems to me that they lower the entry barrier a little, without selling overly pre-packaged solutions. 💪 Just two small comments on the fine-tuning tutorial:
|
Thanks @anakin87 for the heads up! Having BM25 in the first tutorial working with the |
Thank you very much for taking the time to have a look at the tutorials! Your feedback is definitely really helpful. And I'm glad you picked up on this point. Further training the QA model on SQuAD dev is definitely not ideal because you cannot evaluate it anymore on SQuAD. Theoretically it should be generally better just because it has seen more data. But I will ask internally whether we have any more appropriate datasets to work with in this tutorial |
@brandenchan deepset-ai/haystack#3561 is merged, we can now use BM25 with |
@brandenchan to build the BM25 representation, the document store must be initialized as follows: The BM25 representation is not computed by default: if you just want to use the |
Thanks for the heads up! Committing that change now |
@brandenchan - the slug structure is now ready for you to use. Including the auto-generated download buttons. You can merge master and we can take it from there. |
As I will be on PTO next week I will add this note here just in case it is needed. And @julian-risch tagging you here too as it is about tracking in telemetry and the files in the new s3 bucket as well. fyi: we've spoken about this with docs, just putting it in writing.
And with that, suggestion: |
index.toml
Outdated
aliases = ["without-elasticsearch"] | ||
slug = "03_scalable_qa_pipeline" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now that we no longer have and no longer want to maintain "03_build_a_scalable_question_answering_system" --> where do you want the people looking for this to go to? do you want them to be directed to the 1st tutorial? Or this tutorial? We can add the old slug to one of the aliases accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first tutorial, I'd say
@brandenchan here are some preview links for you. Please disregard the url structure in these previews. The slugs in this PR are correct, I just had to make the previews like this because I'm hacking it :) We don't have a proper previewing structure in place yet. |
Standing to the preview, BM25 doesn't appear in Tutorial 1. |
I think it was mistakenly updated in the .md file instead of the notebook file. Fixed it now. Thanks for being alert! :) |
doc_dir = "data/build_a_scalable_question_answering_system" | ||
|
||
fetch_archive_from_http( | ||
url="https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt1.zip", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to change to _txt3.zip
doc_dir = "data/distil_a_reader" | ||
squad_dir = doc_dir + "/squad" | ||
|
||
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dataset can be moved from tutorial 2 to 21 for telemetry @julian-risch ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes can be down. To make this process less error-prone maybe we can have a list with old tutorials dataset URLs and what they become? This is not the only one that needs to change and it would be great if we could do it in one larger batch if possible.
|
||
# Downloading very small dataset to make tutorial faster (please use a bigger dataset for real use cases) | ||
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/squad_small.json.zip" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I am baffled and I think I need help figuring this out @julian-risch
Right now, we look at this dataset to count the times tutorial 2 has been run. Now we're decoupling the distillation bit so this line is out. However, we don't at any point for the training step have users download the data/squad20 "train_filename="dev-v2.0.json"? But the tutorial works. Does the train()
function pull it by name from somewhere? Is it possible to once we merge these tutorial restructures that we track that? @agnieszka-m tagging you here for visibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That can't work. Only possible explanation is that you already have the data_dir = "data/squad20"
directory with some data in it. For example if you ran another tutorial before. We need to also fetch_archive_from_http the data in tutorial 2. So this needs to be added here again. And I could upload a/ datasets/documents/squad_small2.json.zip" or something like that with a different dataset name than before for telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'm wrong.
I just tried running the new tutorial here: https://colab.research.google.com/github/deepset-ai/haystack-tutorials/blob/tutorial_restructure_draft/tutorials/02_finetune_a_reader.ipynb and it does download the dataset.
The explanation is that the processor has two datasets hardcoded here: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L42 One is squad
Here is the logic that decides about downloading that data: https://github.com/deepset-ai/haystack/blob/e266cf6e29f78df751d9dbe7a505886579233aa5/haystack/modeling/data_handler/processor.py#L2250
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, in that case for now, until we have any decisions to change or keep the way we do telemetry for tutorials, I will add a fetch_archive_from_http
to tutorial 2 as well. As you suggested in the comment above. I will create a list of datasets for tutorials in the meantime. Until and if there is a change, it will be healthier to have.
I've already decoupled tutorial 1 and 3 into a separate PR as there's no need for those 2 to wait while we're fixing datasets etc for Tutorial 2 and 21..
Creating a draft of Tutorial 1 in a new tutorial style. To summarize the main design changes: